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Abstract 

Particle Markov chain Monte Carlo tech¬ 
niques rank among current state-of-the-art 
methods for probabilistic program inference. 
A drawback of these techniques is that they 
rely on importance resampling, which results 
in degenerate particle trajectories and a low 
effective sample size for variables sampled 
early in a program. We here develop a for¬ 
malism to adapt ancestor resampling, a tech¬ 
nique that mitigates particle degeneracy, to 
the probabilistic programming setting. We 
present empirical results that demonstrate 
nontrivial performance gains. 


1 Introduction 

Probabilistic programming languages extend tradi¬ 
tional languages with primitives for sampling and con¬ 
ditioning on random variables. Running a program F 
generates random variates x for some subset of pro¬ 
gram expressions, which can be thought of as a sample 
from a prior p(x \ F) implicitly defined by the program. 
In a conditioned program F[y], a subset of expressions 
is constrained to take on observed values y. This de¬ 
fines a posterior distribution p(x \ F[y ]) on the random 
variates x that can be generated by the program F[y], 

Probabilistic programs are in essence procedural repre¬ 
sentations of generative models. These representations 
are often both succinct and extendable, making it easy 
to iterate over alternative model designs. Any program 
that always samples and constrains a fixed set of vari¬ 
ables {x,y} admits an alternate representation as a 
graphical model. In general, a program F[y\ may also 
have random variables that are only generated when 
certain conditions on previously sampled variates are 
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met. The random variables x therefore need not have 
the same entries for all possible executions of F[y\. In 
languages that have recursion and higher-order func¬ 
tions (i.e. functions that act on other functions), it 
is straightforward to define models that can instanti¬ 
ate an arbitrary number of random variables, such as 
certain Bayesian nonparametrics, or models that are 
specified in terms of a generative grammar. At the 
same time this greater expressivity makes it challeng¬ 
ing to design methods for efficient posterior inference 
in arbitrary programs. 


In this paper we show how a recently proposed tech¬ 
nique known as particle Gibbs with ancestor sam¬ 
pling (PGAS) [Lindsten et al. 2012 can be adapted 
to inference in higher-order probabilistic programming 


systems [Mansinghka et al. 

2014 

Wood et al. 

2014 

Goodman et al. 

2008 . A PGAS implementation re- 


quires combining a partial execution history, or pre¬ 
fix, with the remainder of a previously completed ex¬ 
ecution history, which we call a suffix. We develop a 
formalism for performing this operation in a manner 
that guarantees a consistent program state, correctly 
updates the probabilities associated with each sam¬ 
pled and observed random variable, and avoids unnec¬ 
essary recomputation where possible. An empirical 
evaluation demonstrates that the increased statistical 
efficiency of PGAS can easily outweigh its greater com¬ 
putational cost. 


1.1 Related Work 


Current generation probabilistic programming systems 
fall into two broad categories. On the one hand, sys¬ 
tems like Infer.NET iMinka et al.l 120101 and STAN 


Stan Development Team 2014 restrict the language 


syntax in order to omit recursion. This ensures that 
the set of variables is bounded and has a well-defined 
dependency hierarchy. On the other hand languages 


such as Church Goodman et al. 

2008 , Venture Mans- 

inghka et al. 

2014 , Anglican Wood et al. 

2014 , and 

Probabilistic C Paige and Wood 

2014 do not impose 


such restrictions, which makes the design of general- 
purpose inference methods more difficult. 
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Arguably the simplest inference methods for prob¬ 
abilistic programming languages rely on Sequential 
Monte Carlo (SMC) |Del Moral et al. 2006 . These 
“forward” techniques sample from the prior by run¬ 
ning multiple copies of the program, calculating im¬ 
portance weights using the conditioning expressions as 
needed. The only non-trivial requirement for imple¬ 
menting SMC variants is that there exists an efficient 
mechanism for “forking” a program state into multiple 
copies that may continue execution independently. 


An expression e is either a constant literal c, a sym¬ 
bol s, an application (e &e) with operator e and zero 
or more arguments &e, a function literal (lambda (&s) 
e) with argument list (&s), an if-statement (if e e 
e), or a quoted expression (quote e). Each expression 
e evaluates to a value v upon execution. In addition 
to the self-explanatory bool, int, float, and string 
types, values can be primitive procedures (i.e. lan¬ 
guage built-ins such +, -, etc.), compound procedures 
(i.e., closures), and lists of zero or more values (&v). 


Another well-known inference technique for probabilis¬ 
tic programs is lightweight Metropolis-Hasting (LMH) 
Wingate et al. 2011 . LMH methods construct a 


database of sampled values at run time. A change to 
the sampled values is proposed, and the program is re¬ 
run in its entirety, substituting previously sampled val¬ 
ues where possible. The newly constructed database 
of random variables is then either accepted or rejected. 
LMH is straightforward to implement, but costly, since 
the program must be re-run in its entirety to evaluate 
the acceptance ratio. A more computationally effi¬ 


cient strategy is offered by Venture Mansinghka et al. 


2014 , which represents the execution history of a pro¬ 


gram as a graph, in which each evaluated expression is 
a node. A graph walk can then determine the subset 
of expressions affected by a proposed change, allowing 
partial re-execution of the program conditioned on all 
unaffected nodes. 


SMC and MH based algorithms each have trade-offs. 
Techniques that derive from SMC can be run very ef¬ 
ficiently, but suffer from particle degeneracy, resulting 
in a deteriorating quality of posterior estimates calcu¬ 
lated from values sampled early in the program. In 
MH methods subsequent samples are typically corre¬ 
lated, and many updates may be needed to obtain an 
independent sample. As we will discuss in Section [3j 
PGAS can be thought of as a hybrid technique, in the 
sense that SMC is used to generate independent up¬ 
dates to the previous sample, which mitigates the de¬ 
generacy issues associated with SMC whilst increasing 
mixing rates relative to MH. 


2 Probabilistic Functional Programs 


The stochastic type represents stochastic processes, 
whose samples must either be i.i.d. or exchangeable. 
A stochastic process sp supports two operations 

(sample sp) -> v 
(observe sp v) -> v 

The sample primitive draws a value from sp, whereas 
the observe primitive conditions execution on a sam¬ 
ple v that is returned as passed. Both operations as¬ 
sociate a value v to sp as a side-effect, which changes 
the probability of the execution state in the inference 
procedure. For exchangeable processes such as (crp 
1.0) this also affects the probability of future samples. 

Following the convention employed by Venture and 
Anglican, we define programs as sequences of three 
types of top-level statements 

F : := t F | END 

t [assume s e] | [observe e v] | [predict e] 

Variables in the global environment are defined using 
the assume directive. The observe directive conditions 
execution in the same way as its non-toplevel equiv¬ 
alent. The predict directive returns the value of the 
expression e as inference output. 

2.2 Importance Sampling Semantics 

In many probabilistic languages the (observe e v) 
form is semantically equivalent to imposing a rejection- 
sampling criterion. For example, a program subject to 
(observe (> a 0) true) can be interpreted as a rejec¬ 
tion sampler that repeatedly runs the program and 
only returns predict values when (> a 0) holds. 


2.1 Language Syntax 

For the purposes of exposition we will consider a sim¬ 
ple Lisp dialect, extended with primitives for sampling 
and observing random variables 

e ::= c | s | (e &e) | (lambda (&s) e) 

| (if e e e) | (quote e) 
v bool | int | float | string | primitive 
| compound | (&v) | stochastic 


The semantics of observe that we have defined here 
imply an interpretation of a probabilistic program as 
an importance sampler, where sample draws from the 
prior and observe assigns an importance weight. In¬ 
stead of constraining the value of an arbitrary expres¬ 
sion e, observe conditions on the value of (sample e). 
This restricted form of conditioning guarantees that we 
can calculate the likelihood p(v | (sample e)) for every 
observe, as long as we implement a density function 
for all possible stochastic values in the language. 
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More formally, we use the notation F[y] to refer to 
a program conditioned on values y via a sequence of 
top-level [observe e v] statements. Execution of F[y\ 
will require the evaluation of a number of (sample e) 
expressions, whose values we will denote with x. We 
now informally define F[y,x\ as the program in which 
all (sample e) expressions are replaced by conditioned 
equivalents (observe e v), resulting in a fully deter¬ 
ministic execution. Similarly we can define F[x] as the 
program obtained from F[y, x] by replacing [observe 
e v] statements with unconditioned forms [assume s 
(sample e)] with a unique symbol s in each statement. 

Whereas top-level observe statements are fixed in 
number and order, F may not evaluate the same com¬ 
bination of sample calls in every execution. To provide 
a more precise definition of F[x], we associate a unique 
address a with each sample call that can occur in the 
execution of F. We here use a scheme in which the 
run-time address of each evaluation is a concatenation 
a'::(t,p) of the address a! of the parent evaluation and 
a tuple (t,p) in which t identifies the expression type 
and p is the index of the sub-expression within the 
form. This particular scheme labels every evaluation, 
not just the sample calls, allowing us to formally rep¬ 
resent F : A —► £ as a mapping from addresses A to 
program expressions £. We represent x : S —> V as a 
mapping from a subset of addresses 5c^ associated 
with sample calls to values V. Similarly, y : O —> V 
is a mapping from addresses O C A associated with 
observe statements to values. With these definitions 
in place, a conditioned program F[x\ : A —> £ simply 
replaces F(a) = (sample e) with 

F[x\(a) = (observe e x(a)), Va € S. 

The importance sampling interpretation of a program 
F[y] W, x (read as F[x] yields W,x) is defined in 
terms of random variables x and a weight W 

F[y ] W, x W = p(y | F[a;]) . 

By definition, the generated samples x are drawn from 
the prior p(x \ F). The weight W = p(y | F[a;]) is the 
joint probability of all top-level observe statements in 
F[y], Repeated execution of F[y\ yields a weighted 
sample set {W l ,x 1 } that may be used to approximate 
the posterior as 

p{x\F[y}) ~p L (x\F[y}) = ^ . 5 X ‘. 

i 

More generally the importance weight is defined as 
the joint probability of all observe calls (top-level and 
transformed) in the program 

F[y,x]^>W W = p(y, x \ F), 

F[x]^W,y W = p{x\F[y]), 

F~^W,y,x W = 1. 


3 Particle MCMC methods 


3.1 Sequential Monte Carlo 


SMC methods are importance sampling techniques 
that target a posterior p(x | y) on as space X by per¬ 
forming importance sampling on unnornralized densi¬ 
ties {7n(®n)}^! = i defined on spaces of expanding di¬ 
mensionality {X n }n = i, where each X n C X n+ i and 
T/v = X. This results in a series of intermediate parti¬ 
cle sets {w l n , x l n }^ =11 which we refer to as generations. 
Each generation is sampled via two steps, 

~ R(u n | rr n ), x n ~ p n {x n | x n _^). 

Here R(a \ w) is a resampling procedure that returns 
index a = l with probability w l /^Tj l iW l and p n is a 
transition kernel. The samples x l n are assigned weights 

l = 

71 / a ‘„ \ //I a n \ ’ 

”/n—l{ x n -i)Pn\ x n \ x n -i) 


In the context of probabilistic programs, we can define 
a series of partial programs F n [y n ] that truncate at 
each top-level [observe e v] statement. We can then 
sequentially sample F n [y n ,x n _ i] W n ,x n by par¬ 
tially conditioning on x n -i at each generation. Since 
x n is a sample from the prior, this results in an im¬ 
portance weight [Wood et al. 2014 


Wr, = 


P(Vn\ FnWnP 

P(y n - 1 \ F n- 1 [Xn-l]) 


Note that this is simply the likelihood of the n-tli 
top-level observe. In practice we continue execu¬ 
tion relative to F„_i [y n _i, x n -i] to avoid rerunning 
F n [y n , £c„_i] in its entirety. The means that the infer¬ 
ence backend must include a routine for forking mul¬ 
tiple independent executions from a single state. 


3.2 Iterative Conditional SMC 


An advantage of SMC methods is that they provide 
a generic strategy for joint proposals in high dimen¬ 
sional spaces. An importance sampling scheme where 
F[y} W, x draws from the prior has a vanishingly 
small probability of generating a high-weight sample. 
By sampling the smallest possible set of variables x l n at 
each generation and selecting ancestors a l n+1 accord¬ 
ing to the likelihood of the next observed data point, 
we ensure that x l n+1 is sampled conditioned on high- 
weight values of x n from the previous generation. At 
the same time this strategy has a drawback: each time 
the particle set is resampled, the number of unique 
values at previous generations decreases, typically re¬ 
sulting in coalescence to a single common ancestor in 


0(L log L) generations Jacob et al. 2013. 
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In many applications it is not practically possible (due 
to memory requirements) to set I to a value large 
enough to guarantee a sufficient number of indepen¬ 
dent samples at all generations. In such cases, par¬ 


ticle variants of MCMC techniques Andrieu et al. 


2010 can be used to combine samples from multiple 


SMC sweeps. An iterated conditional SMC (ICSMC) 
sampler repeatedly selects a retained particle k with 
probability w^/Yhk' w Ni an d then performs a condi¬ 
tional SMC (CSMC) sweep, where the resampling step 
is conditioned on the inclusion of the retained particle 
at each generation. Formally, this procedure is a par¬ 
tially collapsed Gibbs sampler that targets a density 
4 >(x, a, k) on an extended space 




i -L 1 -l 

Ni a 2:Ni 
N L 


k) = 


W 


N 


St'' 




J N l=i 


„= 2 r=i&<-i 


We use the shorthand x k = x\]'£ N and a k = to 

refer to the sampled values and ancestor indices of the 
retained particle, whose index b n at each generation 
can be recursively defined via bpf = k and 6„_i = a„". 
The notation x~ k and a~ k refers to the complements 
where the retained particle is excluded. An ICSMC 
sampler iterates between two updates 


1. {x*’~ k , a *" fc } ~ <j>{x ~ k , a~ k | x k ,a k , k) 

2. k* ~ 4 , )(k\x*’- k ,a*- k 1 x k J a k ) 


If we interpret x~ k , a and k as auxilliary variables, 
then the marginal on leaves the density 7 n(xn) 
invariant 


Andrieu et al. 2010 


The advantage of ICSMC samplers is that the target 
space is iteratively explored over subsequent CSMC 
sweeps. A disadvantage is that consecutive sweeps 
often yield partially degenerate particles, since many 
newly generated particles will coalesce to the retained 
particle with high probability. For this reason ICSMC 
samplers mix poorly when the number of particles is 
not large enough to generate at least two completely 
independent lineages in a single CSMC sweep. 


3.3 Particle Gibbs with Ancestor Sampling 


PGAS is a technique that augments the CSMC sweep 
with a resampling procedure for the index a*" of the 
retained particle Lindsten et al. 2012 . At a high 


level, this sampling scheme performs two updates 


1. {x*’ k ,a*}~(/)(x k ,a\x k ,k) 

2. k* ~ <f>{k\x*'f~’ k ,a*~ k ,x k ,a k ) 


Here update 1 differs from the normal CSMC update 
in that it samples a complete set of ancestor indices 
a*, not the complement to the retained indices a*’~ k . 
At each generation the ancestor a*" of the retained 
particle is resampled according to a weight 

w l n -i\ N = w l n _ 1 p{y N ,x k N [x l n ]\F N [y n ,x l n ]). 

Here Xff[x l n ] denotes the substitution of x l n into x k N , 
which is defined as a mapping x^ [x l n ] : A k N U A l n —> V 
where entries in x l n : A l n —> V augment or replace 
entries in x k N : A h N —> V. 

Intuitively, ancestor resampling can be thought of as 
proposing new program executions by complementing 
random variables x l n of a partial execution, or prefix, 
with retained values from x k N to specify the future of 
the execution, or suffix. This step is performed at each 
generation, allowing the retained lineage to potentially 
be resampled many times in the course of one sweep. 

Incorporation of the ancestor resampling step results 
in the following updates at each n = 2,..., N 

la. Update the retained particle 

a* n ’ bn ~R(a n |<_ 1|JV ) 

lb. Update the particles for l € {1,..., L} \ b n 

a n ’ 1 ~ R( a n | w n-l) 

X*J ~p{x n |-Fn ]) 

4 Rescoring Probabilistic Programs 

PGAS for probabilistic programs requires calculation 
of p(y N ,x%[x l n ] | F[y n ,x l n ]). This probability is, by 
definition, the importance weight of F]sr[y N ,x%[x l n ]\ 
and can therefore in principle be obtained by exe¬ 
cuting this re-conditioned form. The main drawback 
of this naive approach is that it requires LN eval¬ 
uations of the program in its entirety. This results 
in an 0(LN 2 ) computational cost, which quickly be¬ 
comes prohibitively expensive as the number of gen¬ 
erations N increases. A second complicating factor 
is that naively rewriting part of the execution history 
of the program may not yield a set of random values 
XN[x l n ] that could be generated by running Fiy[y N ]. 

We here develop a formalism that allows regeneration 
of a self-consistent program execution starting from a 
partial execution with random variables x l n , assuming 
all future random samples are inherited from retained 
values x 1 ^. To do so we introduce the notion of a 
trace, a data structure that annotates each evaluation 
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with information necessary to re-execute the expres¬ 
sion relative to a new program state. Given a trace 
for the suffix, i.e. the remaining top-level statements 
in a program, it becomes possible to re-execute con¬ 
ditioned on future random values, in a manner that 
avoid unnecessary reconrputation where possible. 

4.1 Intuition 

In higher order languages with recursion and memo- 
ization, regeneration is complicated by two factors: 

1. The program FN[y N ,x^[x l n ]} may be undercon¬ 
ditioned, in the sense that x^[x l n } does not con¬ 
tain values for some sample calls that can be eval¬ 
uated in its execution. It can also be overcondi¬ 
tioned, when x^[x l n ] contains values for sample 
calls that will never be evaluated. 

2. The expression for the stochastic argument to 

each observe in F N [y N , may need to be 

re-evaluated if it in some way depends on global 
variables defined in F n [y n ,x l n ]. Because each 
variable may in turn reference other variables, we 
must be able to reconstruct the program environ¬ 
ment recursively in order to rescore each observe. 

The first non-triviality arises from the existence of if 
expressions. As an example, consider the program 

0: [assume random? (sample (flip-dist 0.5))] 

1: [assume mu (if random? 

(sample (normal-dist 0.0 1.0)) 

0 . 0 )] 

2: [observe (normal-dist mu 1.0) 0.1] 

The sample expression in line 1 is only evaluated when 
the sample expression in line 0 evaluates to true. More 
generally, any program that contains (sample e) inside 
an if expression will not be guaranteed to instantiate 
the same random variables in cases where the predicate 
of the if expression itself depends on previously sam¬ 
pled values. In programming languages that lack re¬ 
cursion we have the option of evaluating both branches 
and including or excluding the associated probabilities 
conditioned on the predicate value. This is essentially 
the strategy that is employed to handle if expressions 
in Infer.NET and BUGS variants. In languages that 
do permit recursion, this is in general not possible. For 
example, the following program would require evalua¬ 
tion of an infinite number of branches: 

0: [assume geom (lambda (p) 

(if (sample (flip-dist p)) 

1 

(+ 1 (geom p))))] 

1: [observe (poisson-dist (geom 0.5)) 3] 

In other words, we cannot in general pre-evaluate the 
values associated with both branches in the suffix. 


When a predicate in the suffix no longer takes on the 
same value, we have a choice of either rejecting the 
regenerated suffix outright, or updating it using a re¬ 
generation procedure that evaluates the newly chosen 
branch and removes reference to any values sampled in 
the invalidated branch. We here consider the former 
strict form of regeneration, which guarantees that the 
regenerated suffix references precisely the same set of 
sample values as before. 

A second aspect that complicates rescoring is the ex¬ 
istence of menroized procedures. As an example, con¬ 
sider the following infinite mixture model 

0: [assume class-prior (crp 1.0)] 

1: [assume class 

(mem (lambda (n) 

(sample class-prior)))] 

2: [assume class-dist 
(mem (lambda (k) 

(normal-dist 

(sample (normal-dist 0.0 1.0)) 

1 - 0 )))] 

3: [observe (class-dist (class 0)) 2.1] 

4: [observe (class-dist (class 1)) 0.6] 

N+3: [observe (class-dist (class N)) 1.2] 

Here each observe makes a call to class, which sam¬ 
ples an integer class label k from a Chinese restaurant 
Process (CRP). The call to class-dist either retrieves 
an existing stochastic value, or generates one when a 
new value k is encountered. This type of memoization 
pattern allows us to delay sampling of the parameters 
until they are in fact required in the evaluation of a 
top-level observe, and makes it straightforward to de¬ 
fine open world models with unbounded numbers of 
parameters. At the same time it complicates analy¬ 
sis when performing rescoring. Menroized procedure 
calls are semantically equivalent to lazily defined vari¬ 
ables. Programs that rely on memoization can there¬ 
fore essentially define variables in a non-deterministic 
order. A regeneration operation must therefore dy¬ 
namically determine the set of variables that need to 
be re-evaluated at run time. 

4.2 Traced Evaluation 

The operations that need to happen during a rescor¬ 
ing step are (1) the regeneration of a consistent set 
of global environment variables, which includes any 
bindings in the prefix, augmented with any bindings 
defined in the suffix (some of which may require re- 
evaluation as a result of changes to bindings in the 
prefix), (2) a verification that the flow control path in 
the suffix is consistent with the environment bindings 
in the prefix, and (3) the reconrputation of the proba¬ 
bilities of any sampled and observed values whose den¬ 
sity values depend on bindings in the prefix. 
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In order to make it possible to perform the above op¬ 
erations, we begin by introducing a set of annotations 
for each value v that is returned upon evaluation of 
an expression e. In practical terms, each value in the 
language is boxed into a data structure which we call a 
trace. We represent a trace r as tuple (v, e, l, p, w, cr, <fi ). 
v is the value of the expression, e is a partially evalu¬ 
ated expression, whose sub-expressions are themselves 
represented as traces. I is the accumulated log-weight 
of observes evaluated within the expression, p is a 
mapping {s —> r} from symbols to traces, containing 
the subset of the global environment variables that 
were referenced in the evaluation of e. w is a map¬ 
ping {a i ^ (r, v, l)}. It contains an entry at the ad¬ 
dress a of each observe that was evaluated in e and 
its sub-expressions. This entry is represented as a tu¬ 
ple (r, v, l) containing a trace of the first argument to 
the observe (which must be of the stochastic type), 
the observed value v, and the associated log-weight l. 
Similarly a is a mapping {a H > (r, v)} that contains 
an entry for each evaluated sample expression (which 
omits the associated log-weight). The last component 
4> is again a mapping {a K > r} that records all traces 
that appear as conditions in if expressions and thereby 
influence the control flow of program execution. 

We now describe the semantics of the traced evalua¬ 
tion (e,a, R, A) fj. r. The evaluation operator || re¬ 
turns the trace r of an expression e at address a, rel¬ 
ative to a global environment R (i.e. variables defined 
via assume statements) and local bindings A (i.e. vari¬ 
ables bound in compound procedure calls), resulting 
in a trace r = (v, e, l, p, w, a, <f>). We assume the im¬ 
plementation provides a standard evaluation function 
for primitive procedures ewo?(prim vi ... v n ) = v. We 
reiterate that the notation a::(t,p) denotes an evalu¬ 
ation address composed of a parent address a, a type 
identifier t and a sub-expression index p , where t is one 
of i for if, 1 for lambda, q for quote, a for applications, 
and b when evaluating compound procedure bodies. 

Constants c evaluate to: 

(c, a, R, A))| (c,c, 0.0, {},{}, {},{}) • 

We call this type of trace “transparent”, since it con¬ 
tains no references to other traces that may take on 
different values or probabilities in another execution. 

Symbol lookups s in the global environment return the 
value of s stored in R: 

(s, a, R, A) )| (v, s, 0.0, {s h> t}, {},{},{}) 
if R{ s) = t and r = (v, _, _, _, _, _) . 

Lookups from the local environment are inlined: 

(s, a, f?, A) (v, e, 0.0, p, w, cr, <f>) 

if A(s) = t and r = (v, e, l, p , w, a, (f>) . 


Calls to sample inherit annotations from the trace r 
passed as an argument, and add an entry in cr: 

((sample e), a, R, A) )| (v, v, l , p, u>, a[a (r, v)], (j>) 

if (e, a::(s, 0), R, A) )| r = (vi, _, l, p, w, a , <fi) and 
v is drawn from the stochastic process vi . 

Calls to observe inherit from r and add an entry in w: 
((observe ei V 2 ),a, R, A) 

-II (v 2 ,v 2 Ji +l 12 ,p 1 u)[a (r, v 2 , Zi 2 )], cr, (j>) 

if (ei,a::(o,0),i?, A) J| t = (vi, Zi, p, w, a, </>)) 
and Z 12 = £(vi,v 2 ) . 

Here £(vi, v 2 ) is used to denote the log-density of value 
v 2 relative to vi (which must be of type stochastic). 

Primitive procedure applications (primop ei e 2 ) eval¬ 
uate to eiicd(prim vi v 2 ) where v,; is the result of eval¬ 
uating ep. 

((prim ei e 2 ),a,R, A) 

-(1 (ewzZ(prim vi v 2 ), (prim n r 2 ), l, p, u, a, 4>) , 

if (ei, cc::(p, i), R, A) J| r,, and p, to, cr, and <j> are ob¬ 
tained by merging the corresponding components of 
the r,;, and the log-density l is the sum of the 

Application of a single-argument compound procedure 
(i.e. closure) ei leads to the evaluation of the body e 
of the procedure relative to the environments f?i,Ai 
in which the compound procedure was defined: 

((ei e 2 ),a,R,A) )| (v, (n t 2 ),1, p,w,a,</>) 

if 

(ei,a::(a, 0),R,A) )| n, 
n = ((lambda (s) e), I?i, Ai), 

(e 2 ,a::(a, 1),R, A) J| r 2 = (v 2 , 

(e,a::(b,0),i?i,Ai[s >->• r 2 ]) f| r 3 = (v, _), 

and l, p, u>, cr, </> are obtained by combining the corre¬ 
sponding components of the Ty The case with multiple 
arguments is defined similarly. 

Quote expressions (quote r) simply return r: 

((quote e), a, R, A) |! r if (e, a :: (q, 0), R, A) )| r . 

Finally, an if expression (if e ei e 2 ) returns either 
the result of evaluating ei or e 2 . We show only the 
case where the true branch is taken: 

((if e ei e 2 ), a, R, A) )| (v, e, l, p, u, a, (j>[a r]) 

if 

(e, a::(i, 0 ),R, A) )| r = (true, _, _, _, _, _) , 

(ei, a::(i, 1), f?, A) )| n = (v, e, _, _) , 

and l,p,u,a,<j> are obtained by combining the corre¬ 
sponding components of r and ti. The other case is 
that the value of r is false, and has the semantics 
similar to the one above. 
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4.3 Regeneration and Rescoring 

We now define an operation TZ(t , R) = t' that regener¬ 
ates a traced value relative to an environment R. This 
operation performs the following steps: 

1. Re-evaluate predicates: Compare r = </>(a) to 

t' = 77(t, R) for all a. Abort if t' and r have 
different values v. Otherwise update (j)[a r']. 

2. Re-score observe expressions and statements: Let 

(r, v, l ) = w(a), and t' = 7 Z(t, R). If t' and r have 
different values v' 0 and v 0 , recalculate l' = £(vq, v) 
and update uj[a —> Otherwise, update 

w[a -> (t',v,Z)]. 

3. Re-score samples: Let (r, v) = er(a), and t' = 
1Z(t,R). Calculate l' = £(t',v) and update 
co[a !->■ (t', v,/')]. 

4. Regenerate the environment bindings: For all 
symbols s that do not exist in R, let t’ = 
TZ(p(s),R) and update R[s >->■ r'] and p[s >->• r']. 
For existing symbols update p[s >->• i?(s)]. We for 
convenience assume that R is updated in place, 
though this may be avoided by having 1Z return 
a tuple (R' , t'). 

5. If any bindings were changed with new values in 
step 4, regenerate all sub-expressions in e to 
reconstruct e'. 

6. If e' was reconstructed in step 5, evaluate e' and 
update v to the result of this evaluation. 

Rescoring an individual trace may be performed as 
part of the regeneration sweep by calculating a differ¬ 
ence in log-density A l. This A l is the sum of all terms 
Z' — Z in step 2 and all terms l' in step 3, and any A V 
values return from recursive calls to TZ. 

We have omitted a few technical details in this high- 
level description. The first is that we build a map 
C = {t —>• t', ...} on a call to 7 Z, which is passed as 
an additional argument in recursive calls to TZ , effec¬ 
tively memoizing the computation relative to a given 
initial environment R. This reduces the computation 
on recursive calls, which potentially expand the same 
traces r many times as sub-expressions of e. 

4.4 Ancestor Resampling 

Given an implementation of a traced evaluator and 
a regenerating/rescoring procedure 7 Z, an imple¬ 
mentation for PGAS in probabilistic programs be¬ 
comes straightforward. We represent the programs 
F n [y n ,x l n ] evaluated up to the first n top-level state¬ 
ments as pairs {R l nl T l n ). We now construct a con¬ 
catenated suffix T n by recursively re-evaluating T n = 
(cons r^ 71 Tn+i)- For n = 1 we then define 
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t 

Figure 1: Effective sample size as a function of t , 
normalized by the number of PMCMC sweeps (top) 
and wall time in seconds (bottom). PGAS with 10 
particles is shown in cyan. ICSMC results are shown 
in blue, with shorter dashes representing increasing 
particle counts 10, 20, 50, 100, 200, and 500. All lines 
show the median ESS over 25 independent restarts. 

P(yN, x> h[ x n l ] \ F N[y n i x n]) as the log-weight of the 
rescored trace 72.(7^, R l n -i)- Note here that by con¬ 
struction, we may extract T n +i from T n without addi¬ 
tional computation. 

5 Experiments 

To evaluate the mixing properties of PGAS relative to 
ICSMC, we consider a linear dynamical system (i.e. 
a Kalman smoothing problem) with a 2-dimensional 
latent space and a D-dimensional observational space, 

z t ~ Norm (A • z t _i,Q), y t ~ Norm(C ■ z t ,R). 

We impose additional structure by assuming that the 
transition matrix A is a simple rotation with angular 
velocity u>, whereas the transition covariance Q is a 
diagonal matrix with constant coefficient q, 

. cos to — sin oj _ _ 

A = , Q = ql 2 . 

sinw cosw 

We simulate data with D = 36 dimensions, T = 100 
time points, oj = 47t/T, q = 0.1, a = 0.1, and r = 0.01. 
We now consider an inference setting where C and R 
are assumed known and estimate the state trajectory 
Zi : t, as well as the parameters of the transition model 
ui and q , which are given mildly informative priors, 
to ~ Gamma(10, 2.5) and q ~ Gamma(10,100). 

While this is a toy problem where an expectation max¬ 
imization (EM) algorithm could likely be derived, it 
is illustrative of the manner in which probabilistic 
programs can extend models by imposing additional 
structure, in this case the dependency of A on u>. This 













Particle Gibbs with Ancestor Sampling for Probabilistic Programs 


modified Kalman smoothing problem can be described 
in a small number of program lines 

[assume C [...]] ; assumed known 
[assume R [...]] ; assumed known 
[assume omega (* (sample (gamma-dist 10. 2.5)) 

(/ Pi T))] 

[assume A [[(cos omega) (* -1 (sin omega))] 

[(sin omega) (cos omega) ]]] 

[assume q (sample (gamma-dist 10. 100.))] 

[assume Q (* (eye 2) q)] 

[assume x 

(mem (lambda (t) 

(if (< t 1) 

[ 1 . 0 .] 

(sample (mvn-dist (mmul A (x (dec t))) W)))))] 
[observe (mvn (mmul C (x 1)) R) [...]] 
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Figure 2: Standard deviation of posterior estimates 
E[uj\ and E[q] over 25 independent restarts, as a func¬ 
tion of the PMCMC sweep. PGAS results are shown 
in cyan, ICSMC results are shown in blue, with shorter 
dashes indicating a larger particle count. 


[observe (mvn (mmul C (x 100)) R) [...]] 
[predict omega] 

[predict q] 


6 Discussion 


Here [. .. ] refers to a vector or matrix literal. 


We compare results for PGAS with 10 particles to IC¬ 
SMC with 10,20,50,100,200, and 500 particles. In 
each case we run 100 PMCMC sweeps and 25 restarts 
with different random seeds. To characterize mixing 
rates we calculate the effective sample size (ESS) of 
the aggregate sample set {w^’ l ,z^ 1 } over all sweeps 
s = {!>••• 1100}, 


ESS t 


100 L 




k \2 : 




5 — 1 1 = 1 


Here V t k represents the total importance weight asso¬ 
ciated with each unique value z k in {z®’ Z }. 

Figure [I] shows the ESS as a function of t. ICSMC 
shows a decreasing ESS as t approaches 0, indicating 
poor mixing for values sampled at early generations. 
PGAS, in contrast, exhibits an ESS that fluctuates but 
is otherwise independent of t. ESS estimates varied 
approximately 15% relative to the mean across inde¬ 
pendent restarts. This suggests that fluctuations in 
the ESS reflect variations in the prior probability of 
latent transitions p(zt | Zt-i, A). 


For this model, our PGAS implementation with 10 
particles has a computational cost per sweep com¬ 
parable to that of ICSMC with 300 particles. How¬ 
ever, when we consider the ESS per computation time, 
the increases in mixing efficiency outweigh increases in 
computational cost for state estimates below t ~ 50. 
This is further illustrated in Figure [2j which shows 
the standard deviation of sample estimates of the pa¬ 
rameters co and q. PGAS shows better convergence per 
sweep, particularly for estimates of q. ICSMC with 500 
particles performs similarly to PGAS with 10 particles 
when estimating u, though ICSMC with 500 particles 
notably has a higher cost per sweep. 


Relative to other PMCMC methods such as ICSMC, 
PGAS methods have qualitatively different mixing 
characteristics, particularly for variables sampled early 
in a program execution. Implementing PGAS in the 
context of probabilistic programs poses technical chal¬ 
lenges when programs can make use of recursion and 
memoization. The technique for traced evaluation de¬ 
veloped here incurs an additional computational over¬ 
head, but avoids unnecessary recomputation during re¬ 
generation. When the cost of recomputation is large 
this will result in computational gains relative to a 
naive implementation that re-executes the suffix fully. 
Note that our approach tracks upstream, not down¬ 
stream dependencies. In other words, we know what 
environment variables affect the value of a given ex¬ 
pression, but not which expressions in a suffix de¬ 
pend on a given variable. All referenced symbol values 
must therefore be checked during regeneration, which 
can require a 0(LN 2 ) computation in itself. Further 
gains could be obtained constructing a downstream de¬ 
pendency graph for the suffix, allowing more targeted 
regeneration via graph walk techniques analogous to 


those employed in Venture Mansinghka et al. 2014 


At the same time, the empirical results presented here 
are indicative of the fact that, even without these ad¬ 
ditional optimizations, PGAS can easily yield better 
statistical results in cases where the parameter space 
is large and ICSMC sampling fails to mix. 
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